AITopics | minority class sample

Collaborating Authors

minority class sample

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Data Balancing Strategies: A Survey of Resampling and Augmentation Methods

Yousefimehr, Behnam, Ghatee, Mehdi, Seifi, Mohammad Amin, Fazli, Javad, Tavakoli, Sajed, Rafei, Zahra, Ghaffari, Shervin, Nikahd, Abolfazl, Gandomani, Mahdi Razi, Orouji, Alireza, Kashani, Ramtin Mahmoudi, Heshmati, Sarina, Mousavi, Negin Sadat

arXiv.org Machine LearningMay-21-2025

Imbalanced data poses a significant obstacle in machine learning, as an unequal distribution of class labels often results in skewed predictions and diminished model accuracy. To mitigate this problem, various resampling strategies have been developed, encompassing both oversampling and undersampling techniques aimed at modifying class proportions. Conventional oversampling approaches like SMOTE enhance the representation of the minority class, whereas undersampling methods focus on trimming down the majority class. Advances in deep learning have facilitated the creation of more complex solutions, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which are capable of producing high-quality synthetic examples. This paper reviews a broad spectrum of data balancing methods, classifying them into categories including synthetic oversampling, adaptive techniques, generative models, ensemble-based strategies, hybrid approaches, undersampling, and neighbor-based methods. Furthermore, it highlights current developments in resampling techniques and discusses practical implementations and case studies that validate their effectiveness. The paper concludes by offering perspectives on potential directions for future exploration in this domain.

artificial intelligence, data quality, machine learning, (19 more...)

arXiv.org Machine Learning

2505.13518

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
South America > Uruguay > Maldonado > Maldonado (0.04)
Africa > Mali (0.04)
(6 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Addressing Class Imbalance with Probabilistic Graphical Models and Variational Inference

Lou, Yujia, Liu, Jie, Sheng, Yuan, Wang, Jiawei, Zhang, Yiwei, Ren, Yaokun

arXiv.org Artificial IntelligenceApr-9-2025

This study proposes a method for imbalanced data classification based on deep probabilistic graphical models (DPGMs) to solve the problem that traditional methods have insufficient learning ability for minority class samples. To address the classification bias caused by class imbalance, we introduce variational inference optimization probability modeling, which enables the model to adaptively adjust the representation ability of minority classes and combines the class-aware weight adjustment strategy to enhance the classifier's sensitivity to minority classes. In addition, we combine the adversarial learning mechanism to generate minority class samples in the latent space so that the model can better characterize the category boundary in the high-dimensional feature space. The experiment is evaluated on the Kaggle "Credit Card Fraud Detection" dataset and compared with a variety of advanced imbalanced classification methods (such as GAN-based sampling, BRF, XGBoost-Cost Sensitive, SAAD, HAN). The results show that the method in this study has achieved the best performance in AUC, Precision, Recall and F1-score indicators, effectively improving the recognition rate of minority classes and reducing the false alarm rate. This method can be widely used in imbalanced classification tasks such as financial fraud detection, medical diagnosis, and anomaly detection, providing a new solution for related research.

artificial intelligence, data mining, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2504.05758

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)

Genre: Research Report > New Finding (0.87)

Industry:

Banking & Finance (1.00)
Law Enforcement & Public Safety > Fraud (0.90)
Information Technology > Security & Privacy (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.69)

Add feedback

A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis

Bacha, Soufiane, Ning, Huansheng, Mostefa, Belarbi, Sarwatt, Doreen Sebastian, Dhelim, Sahraoui

arXiv.org Artificial IntelligenceMar-15-2025

Accurate illness diagnosis is vital for effective treatment and patient safety. Machine learning models are widely used for cancer diagnosis based on historical medical data. However, data imbalance remains a major challenge, leading to hindering classifier performance and reliability. The SMOTEBoost method addresses this issue by generating synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary and can produce noisy samples. This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations. Firstly, RE-SMOTEBoost focuses on generating synthetic samples in overlapping regions to better capture the decision boundary using roulette wheel selection. Secondly, it incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data. Thirdly, we introduce a double regularization penalty to control the synthetic samples proximity to the decision boundary and avoid class overlap. These enhancements enable higher-quality oversampling of the minority class, resulting in a more balanced and effective training dataset. The proposed method outperforms existing state-of-the-art techniques when evaluated on imbalanced datasets. Compared to the top-performing sampling algorithms, RE-SMOTEBoost demonstrates a notable improvement of 3.22\% in accuracy and a variance reduction of 88.8\%. These results indicate that the proposed model offers a solid solution for medical settings, effectively overcoming data scarcity and severe imbalance caused by limited samples, data collection difficulties, and privacy constraints.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2503.12239

Country:

Europe > Portugal > Coimbra > Coimbra (0.05)
Asia > China > Beijing > Beijing (0.04)
Africa > Middle East > Algeria > Tiaret Province > Tiaret (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.52)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

A Structured Reasoning Framework for Unbalanced Data Classification Using Probabilistic Models

Du, Junliang, Dou, Shiyu, Yang, Bohuan, Hu, Jiacheng, An, Tai

arXiv.org Artificial IntelligenceFeb-5-2025

This paper studies a Markov network model for unbalanced data, aiming to solve the problems of classification bias and insufficient minority class recognition ability of traditional machine learning models in environments with uneven class distribution. By constructing joint probability distribution and conditional dependency, the model can achieve global modeling and reasoning optimization of sample categories. The study introduced marginal probability estimation and weighted loss optimization strategies, combined with regularization constraints and structured reasoning methods, effectively improving the generalization ability and robustness of the model. In the experimental stage, a real credit card fraud detection dataset was selected and compared with models such as logistic regression, support vector machine, random forest and XGBoost. The experimental results show that the Markov network performs well in indicators such as weighted accuracy, F1 score, and AUC-ROC, significantly outperforming traditional classification models, demonstrating its strong decision-making ability and applicability in unbalanced data scenarios. Future research can focus on efficient model training, structural optimization, and deep learning integration in large-scale unbalanced data environments and promote its wide application in practical applications such as financial risk control, medical diagnosis, and intelligent monitoring.

artificial intelligence, machine learning, markov network, (17 more...)

arXiv.org Artificial Intelligence

2502.03386

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Switzerland (0.04)

Genre: Research Report > New Finding (0.69)

Industry:

Banking & Finance (1.00)
Health & Medicine (0.89)
Law Enforcement & Public Safety > Fraud (0.69)
Information Technology > Security & Privacy (0.48)

Add feedback

A binary PSO based ensemble under-sampling model for rebalancing imbalanced training data

Li, Jinyan, Wu, Yaoyang, Fong, Simon, Tallón-Ballesteros, Antonio J., Yang, Xin-she, Mohammed, Sabah, Wu, Feng

arXiv.org Artificial IntelligenceJan-30-2025

Ensemble technique and under-sampling technique are both effective tools used for imbalanced dataset classification problems. In this paper, a novel ensemble method combining the advantages of both ensemble learning for biasing classifiers and a new under-sampling method is proposed. The under-sampling method is named Binary PSO instance selection; it gathers with ensemble classifiers to find the most suitable length and combination of the majority class samples to build a new dataset with minority class samples. The proposed method adopts multi-objective strategy, and contribution of this method is a notable improvement of the performances of imbalanced classification, and in the meantime guaranteeing a best integrity possible for the original dataset. We experimented the proposed method and compared its performance of processing imbalanced datasets with several other conventional basic ensemble methods. Experiment is also conducted on these imbalanced datasets using an improved version where ensemble classifiers are wrapped in the Binary PSO instance selection. According to experimental results, our proposed methods outperform single ensemble methods, state-of-the-art under-sampling methods, and also combinations of these methods with the traditional PSO instance selection algorithm.

artificial intelligence, evolutionary algorithm, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.01655

Country:

Asia > Macao (0.04)
Oceania > Australia (0.04)
North America > United States > District of Columbia > Washington (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
(4 more...)

Add feedback

Enhancing Imbalance Learning: A Novel Slack-Factor Fuzzy SVM Approach

Tanveer, M., Tiwari, Anushka, Akhtar, Mushir, Lin, C. T.

arXiv.org Artificial IntelligenceNov-26-2024

In real-world applications, class-imbalanced datasets pose significant challenges for machine learning algorithms, such as support vector machines (SVMs), particularly in effectively managing imbalance, noise, and outliers. Fuzzy support vector machines (FSVMs) address class imbalance by assigning varying fuzzy memberships to samples; however, their sensitivity to imbalanced datasets can lead to inaccurate assessments. The recently developed slack-factor-based FSVM (SFFSVM) improves traditional FSVMs by using slack factors to adjust fuzzy memberships based on misclassification likelihood, thereby rectifying misclassifications induced by the hyperplane obtained via different error cost (DEC). Building on SFFSVM, we propose an improved slack-factor-based FSVM (ISFFSVM) that introduces a novel location parameter. This novel parameter significantly advances the model by constraining the DEC hyperplane's extension, thereby mitigating the risk of misclassifying minority class samples. It ensures that majority class samples with slack factor scores approaching the location threshold are assigned lower fuzzy memberships, which enhances the model's discrimination capability. Extensive experimentation on a diverse array of real-world KEEL datasets demonstrates that the proposed ISFFSVM consistently achieves higher F1-scores, Matthews correlation coefficients (MCC), and area under the precision-recall curve (AUC-PR) compared to baseline classifiers. Consequently, the introduction of the location parameter, coupled with the slack-factor-based fuzzy membership, enables ISFFSVM to outperform traditional approaches, particularly in scenarios characterized by severe class disparity. The code for the proposed model is available at \url{https://github.com/mtanveer1/ISFFSVM}.

artificial intelligence, class sample, machine learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TETCI.2024.3524718

2411.17128

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > District of Columbia > Washington (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.98)

Add feedback

Enhanced Credit Score Prediction Using Ensemble Deep Learning Model

Xing, Qianwen, Yu, Chang, Huang, Sining, Zheng, Qi, Mu, Xingyu, Sun, Mengying

arXiv.org Artificial IntelligenceNov-12-2024

In contemporary economic society, credit scores are crucial for every participant. A robust credit evaluation system is essential for the profitability of core businesses such as credit cards, loans, and investments for commercial banks and the financial sector. This paper combines high-performance models like XGBoost and LightGBM, already widely used in modern banking systems, with the powerful TabNet model. We have developed a potent model capable of accurately determining credit score levels by integrating Random Forest, XGBoost, and TabNet, and through the stacking technique in ensemble modeling. This approach surpasses the limitations of single models and significantly advances the precise credit score prediction. In the following sections, we will explain the techniques we used and thoroughly validate our approach by comprehensively comparing a series of metrics such as Precision, Recall, F1, and AUC. By integrating Random Forest, XGBoost, and with the TabNet deep learning architecture, these models complement each other, demonstrating exceptionally strong overall performance.

accuracy, dataset, ensemble model, (10 more...)

arXiv.org Artificial Intelligence

doi: 10.23977/jaip.2024.070316

2410.00256

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
North America > United States > Massachusetts > Suffolk County > Boston (0.05)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance > Credit (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Ratio law: mathematical descriptions for a universal relationship between AI performance and input samples

Kang, Boming, Cui, Qinghua

arXiv.org Artificial IntelligenceNov-1-2024

Artificial intelligence based on machine learning and deep learning has made significant advances in various fields such as protein structure prediction and climate modeling. However, a central challenge remains: the "black box" nature of AI, where precise quantitative relationships between inputs and outputs are often lacking. Here, by analyzing 323 AI models trained to predict human essential proteins, we uncovered a ratio law showing that model performance and the ratio of minority to majority samples can be closely linked by two concise equations. Moreover, we mathematically proved that an AI model achieves its optimal performance on a balanced dataset. More importantly, we next explore whether this finding can further guide us to enhance AI models' performance. Therefore, we divided the imbalanced dataset into several balanced subsets to train base classifiers, and then applied a bagging-based ensemble learning strategy to combine these base models. As a result, the equation-guided strategy substantially improved model performance, with increases of 4.06% and 5.28%, respectively, outperforming traditional dataset balancing techniques. Finally, we confirmed the broad applicability and generalization of these equations using different types of classifiers and 10 additional, diverse binary classification tasks. In summary, this study reveals two equations precisely linking AI's input and output, which could be helpful for unboxing the mysterious "black box" of AI.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2411.00913

Country:

Asia > China > Hubei Province > Wuhan (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.96)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Overcoming Imbalanced Safety Data Using Extended Accident Triangle

Sun, Kailai, Lan, Tianxiang, Goh, Yang Miang, Huang, Yueng-Hsiang

arXiv.org Machine LearningAug-11-2024

There is growing interest in using safety analytics and machine learning to support the prevention of workplace incidents, especially in high-risk industries like construction and trucking. Although existing safety analytics studies have made remarkable progress, they suffer from imbalanced datasets, a common problem in safety analytics, resulting in prediction inaccuracies. This can lead to management problems, e.g., incorrect resource allocation and improper interventions. To overcome the imbalanced data problem, we extend the theory of accident triangle to claim that the importance of data samples should be based on characteristics such as injury severity, accident frequency, and accident type. Thus, three oversampling methods are proposed based on assigning different weights to samples in the minority class. We find robust improvements among different machine learning algorithms. For the lack of open-source safety datasets, we are sharing three imbalanced datasets, e.g., a 9-year nationwide construction accident record dataset, and their corresponding codes.

accident, accuracy, dataset, (16 more...)

arXiv.org Machine Learning

2408.07094

Country:

Asia > Singapore (0.05)
North America > United States > Oregon (0.04)
South America > Brazil (0.04)

Genre:

Questionnaire & Opinion Survey (0.93)
Research Report > New Finding (0.47)

Industry:

Health & Medicine > Consumer Health (1.00)
Transportation > Ground > Road (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback

Minimum Enclosing Ball Synthetic Minority Oversampling Technique from a Geometric Perspective

Shangguan, Yi-Yang, Chen, Shi-Shun, Li, Xiao-Yang

arXiv.org Artificial IntelligenceAug-6-2024

Class imbalance refers to the significant difference in the number of samples from different classes within a dataset, making it challenging to identify minority class samples correctly. This issue is prevalent in real-world classification tasks, such as software defect prediction, medical diagnosis, and fraud detection. The synthetic minority oversampling technique (SMOTE) is widely used to address class imbalance issue, which is based on interpolation between randomly selected minority class samples and their neighbors. However, traditional SMOTE and most of its variants only interpolate between existing samples, which may be affected by noise samples in some cases and synthesize samples that lack diversity. To overcome these shortcomings, this paper proposes the Minimum Enclosing Ball SMOTE (MEB-SMOTE) method from a geometry perspective. Specifically, MEB is innovatively introduced into the oversampling method to construct a representative point. Then, high-quality samples are synthesized by interpolation between this representative point and the existing samples. The rationale behind constructing a representative point is discussed, demonstrating that the center of MEB is more suitable as the representative point. To exhibit the superiority of MEB-SMOTE, experiments are conducted on 15 real-world imbalanced datasets. The results indicate that MEB-SMOTE can effectively improve the classification performance on imbalanced datasets.

class sample, minority class sample, representative point, (15 more...)

arXiv.org Artificial Intelligence

2408.03526

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.48)
Law Enforcement & Public Safety > Fraud (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback